SPECTRAL CONDITION NUMBERS OF ORTHOGONAL PROJECTIONS 
AND FULL RANK LINEAR LEAST SQUARES RESIDUALS* 

JOSEPH F. GRCARt 

Abstract. A simple fomiula is proved to be a tight estimate for the condition number of the full rank linear least 
squares residual with respect to the matrix of least squares coefficients and scaled 2-norms. The tight estimate reveals 
that the condition number depends on three quantities, two of which can cause ill-conditioning. The numerical linear 
algebra literature presents several estimates of various instances of these condition numbers. All the prior values 
exceed the formula introduced here, sometimes by large factors. 
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1. Introduction. 

1.1. Purpose. Least squares residuals are quite important numerically. The residuals 
measure the quality of fits in regression analysis, and forming orthogonal projections is an 
essential step in many iterative algorithms for linear equations or matrix eigenvalues. 

This paper determines a tight estimate for the condition number of the residual in full 
rank least squares problems. Equivalently, the condition number of orthogonal projections 
into the span of linearly independent vectors is also estimated. The condition numbers are 
with respect to the matrix of least squares coefficients and with respect to scaled 2-norms. The 
condition number of the residual, like the solution, is the value of an optimization problem 
that does not have an explicit formula but which does have a tight estimate. 

This introduction provides some background material. Section|2]discusses the evaluation 
of condition numbers from Jacobian matrices. Section [3] describes the tight estimate of the 
condition number and provides an example; this material is appropriate for classroom pre- 
sentation. Section[4]proves that the condition number varies from the estimate within a factor 
of a/2; the linear algebra is complicated but straightforward given an identity from a previous 
paper (Grcar 2009). Section|5]compares the results to the literature. Section|6]discusses the 



application to projections and to iterative algorithms. 

1.2. Prior Work. Conditioning with respect to perturbations of the matrix A is the most 
interesting aspect of least squares problems, 

a; = argmin ||6 — Au||2 r ~ b — Ax . (1-1) 

u 

The condition numbers Xxi^), of x with respect to A for various norms, have been studied 



in dozens of papers and books since Golub and Wilkinson ( jl966 ) discovered the condition 



number for 2-norms can depend on the square of the matrix condition number. Thus, it was 
equally surprising when Bjorck ( 1967', p. 16, eqn. 7.7) discovered the conditioning of the 



residual is independent of the square]^ Even so, Bjorck's original formula turned out to be an 
overestimate. Roughly the same formula is still found in many textbooks (section [573]l. 



Wedin ( 1973 p. 224, eqn. 5.4) derived a perturbation bound for the residual with respect 
to A again for 2-norms. This paper shows that Wedin's bound contains an estimate for Xr{A) 
that is accurate within a factor of 2. Wedin noted that his perturbation bound could be "almost 
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attained" (p. 225). However, (p. 226) he also remarked that his paper only demonstrated near 
attainment for a perturbation bound on the least squares solution (not the residual). Thus the 
published hterature has no prior proof of attainment for an error bound of the residual. 

|Geurts|fl982) and |Gratton| ( | 1 996) have used Jacobian matrices to derive condition num- 
bers, or estimates of condition numbers, for least squares solutions. Their results and those 
of |Bjorck| ^1967) , [Maly shev (TOOT), and |Wedin| ( [T973] l for the condition number of the solu- 
tion are summarized by Grcar (2009). There has been no similar determination of condition 
numbers based on Jacobian matrices for the residual. The spectral condition number of the 
residual, like the solution, is the value of an optimization problem that does not have an 
explicit formula but which does have a tight estimate. No tight estimates for the condition 
number of the least squares residual have been established previously. 

2. Condition numbers. 

2.1. Error bounds and definitions of condition numbers. "Perturbation bounds" are 
used in numerical analysis to limit the sensitivity of the solution of a problem to changes in the 
initial data. Such bounds are customarily derived using matrix-vector algebra and norms; the 
coefficients of the data perturbations in these bounds are sometimes referred to as condition 



numbers. For example, in one of the earliest books on rounding error analysis, Wilkinson 



( 1963[ p. 29) wrote "we shall refer to [the coefficients] as condition numbers " Many 



numerical analysts probably agree with Wilkinson in the interest of deriving error bounds, but 
the name "condition number" is used sparingly because the coefficients are only upper limits 
for condition numbers unless the error bounds are the smallest possible, equivalently, unless 



the error bounds are attained. Malyshev ( |2003[ p. 1 187) observed, "the bounds are commonly 



accepted as condition numbers, and any discussion about their sharpness is usually avoided." 

The oldest way to derive perturbation bounds is by differential calculus. If y = f{x) 
is a vector valued function of the vector x whose partial derivatives are continuous, then the 
partial derivatives give the best estimate of the change to y for a given change to x 

Ay^f{x + Ax)-fix)Kjf{x)Ax (2.1) 

where Jf{x) is the Jacobian matrix of the partial derivatives of y with respect to x. The 
magnitude of the error in the first order approximation ( |2.1| i is bounded by Landau's little 
o(|j Aa;||) for all sufficiently small || Aa;|||^Thus Jf{x) Ax is the unique linear approximation 
to Ay in the vicinity of a; Taking norms produces a perturbation bound, 

||Ay|| < ||J^(x)|| ||Ax|| +o(||Aa;||). (2.2) 



Equation (2.2 1 is the smallest possible bound on \\Ay\\ in terms of || Ax|| provided the norm 



for the Jacobian matrix is induced from the norms for Ay and Aa;. In this case for each x 



there is some Ax, which is nonzero but may be chosen arbitrarily small, so the bound (2.2 1 is 



attained to within the higher order term, o(|| Aa;||). There may be many other ways to define 



condition numbers, but because equation (2.2i is the smallest possible bound, any definition 



of a condition number for use in bounds equivalent to (2.2 1 must arrive at the same value, 
Xyi^) — \\Jf{x)\\^ The matrix norm may be too complicated to have an explicit formula, 
but tight estimates can be derived as in this paper. 



^The o(|| Ax|| ) agreement is independent of the norm because all norms for finite dimensional spaces are equiv- 
alent jStewart and Sun| |1990| p. 54, thm. 1.7). 
^Any other linear function added to J fix 



^Any other linear function added to Jf{x) Ax differs from Ay by 0(1] Aa;||) and therefore does not provide a 

o(||Ax|| ) approximation. 

■*A theory of condition numbers in terms of Jacobian matrices was developed by |Rice| jl966| p. 292, thm. 4). 
See also |Trefethen and Bau||T997| p. 90) for the present definition. 
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2.2. One or separate condition numbers. Many problems depend on two parameters 
u, V which may consist of the entries of a matrix and a vector (for example). In principle it is 
possible to treat the parameters altogether]^ A condition number for y with respect to joint 
changes in u and v requires a common norm for perturbations to both. Such a norm is 

max{||AM||, llAwll} . (2.3) 

A single condition number then follows that appears in an optimal error bound, 

llAyll < \\Jf{u,v)\\ max{||Au||, ||Aw||} + o (max { || A-it|| , ||Au||}) . (2.4) 

The Jacobian matrix Jf{u, v) contains the partial derivatives of y = f{u, v) with respect to 
the entries of both u and v. The value of the condition number is again Xyi'^--, v) = \\Jf{u, v)\\ 
where the matrix norm is induced from the norm for Ay and the norm in equation ( |2.3| l. 

Because u and v may enter into the problem in much different ways, it is customary to 
treat each separately. This approach recognizes that the Jacobian matrix is a block matrix 

.Jf{u,v) = [ J/,(u) Jf^{v) ] 

where the functions fi{u) — f{u, v) and /2(u) = f{u, v) have v and u fixed, respectively. 
The first order differential approximation ( |2.1| i is unchanged but is rewritten with separate 
terms for u and v, 

Ay « J/i (u) Au + J/, {v) Av , (2.5) 

and a perturbation bound is obtained by applying the triangle inequality, 

||Ay|| < \\Jf,{u)Au + Jf^{v)Av\\ + o (max { || Au||, ||Au||}) 

< \\Jf,{u)\\ ||Au|| + ||J;,(i;)|| ||A«|| +o(max{||A7.||, \\Av\\}) . (2.6) 

The coefficients Xi/(^) = ll^/il"")!! ^"d Xyi'v) — ||'^/2(^)ll the separate condition num- 
bers of y with respect to u and v, respectively. 

These two different approaches lead to error bounds ( |2.4| |2.6[ ) that differ by at most a 
factor of 2. This fact is a property of induced norms. Consider a p x (m + rt) block matrix 



y=[A B] 



and suppose norms are given for MP, M™ and M" as spaces of column vectors. A norm can 
be defined for as 



max 



{\M,\\v\\}. 



These norms for Rp, M"", M", and induce norms for A, B, and [A B ], 

11.11 ll^""!! linn nr. niii \\Au + Bv\\ 

||A|| = max Y-rr , = max \-rp , || [ A S ] || = max " " 



u^o \\u\\ ' v^o \\v\\ ' " '- " «5^o or -u^^o max {||ii||, ||w|| } 



^As will be discussed, |Gratton]jl996f derived a joint condition number of the least squares solution with respect 
to a Frobenius norm of the matrix and vector that define the problem. 

""Gratton ( 1996 1 derived a joint condition number of the least squares solution with respect to a Frobenius norm 
of the matrix and vector that define the problem. 
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The norm of the block matrix has a simple upper bound, 

\\Au + Bv\\ 

max r i 

u^o or v^o max <^\\u\\, \\v\\j 

\\M\ 



\[^ B]\ 



< 



\Bv\ 



u^o or v^o max 1 1 



max ■ 

\\A\\ 



\\B\ 



+ max ■ 



^lli ll'^^ll} or -u^o max I ||u|| , ||f||} 

\Bv\\ 
M 



(2.7) 



and a simple lower bound. 



ll^ll 



max ■ 



\Au\\ 



max 



\Au + Bv\\ 



< 



u^Oand-u=o max{||u||, ||w||} 

II + Bv\\ 



11^0 or u^o max {||u|| . ||u||} 



|[^ ^]| 



and similarly ||i?|| < ||[ ^ ^ ]ll. Altogether, from equations (2.7 

PI 



2.8 1 



B 



<max{P||, ||B||} < B ]\\ < \\A\\ 



\B\\ 



(2.8) 



(2.9) 



which means that ||yl|| + ||i?|| overestimates \\[ A _B ]|| by at most a factor of 2. Returning 
to the Jacobian matrices A = Jf^ (u), B = Jf^ (w), and [A B ] — Jf{u, v), equation (2.9 1 
can be rewritten 



Xyiu) + Xyiv) 



< 



Xyiu,v) < Xy{u)+Xy{v)- 



(2.10) 



Thus, for the purpose of deriving tight estimates of joint condition numbers, it suffices to 
consider Xy{u) and Xy{v) separately. 

3. Conditioning of the least squares residual. 

3.1. Reason for considering full rank problems. For any matrix A and any similarly 
sized column vector b, the linear least squares problem ( |l.l| i need not have an unique solution 
X, but it always has an unique residual r = b — Ax = {I — P)b where P — AA^ is the 
orthogonal projection into the column space of A, col (A), and where A''^ is the pseudoinverse 
of ^. If A has full column rank, then P = A{A* A)~''- A* . Changes to A affect r differently 
when A does not have full column rank. In that case, b £ col (A + for every nonzero 
right null vector z, so small changes to A can produce large changes to r. In other words, a 
condition number of r with respect to rank deficient A does not exist or is "infinite." Pertur- 
bation bounds in the rank deficient case can be found by restricting changes to the matrix, for 
which see iBj5rck| ( |l996| p. 30, eqn. 1.4.27) and [Stewart and Sun| ( [19901 PP- 136-162). That 
theory is beyond the scope of the present discussion. 

3.2. The condition numbers. This section summarizes the results and presents an ex- 
ample. Proofs are in section]?] It is assumed that A has full column rank and neither the 
solution X nor the residual r of the least squares problem are zero. The residual is proved to 
have a condition number Xr (^) with respect to A within the limits. 




< xAA) < 




(3.1) 
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The quantities K2, 9, and V are written bold to emphasize they are the only quantities affecting 
the tight estimate of the condition number; they are defined below. There is also a condition 
number with respect to b. 



xAb) = csc(0) 



(3.2) 



These are condition numbers when the following scaled 2-norms are used to measure the 
perturbations to A, b, and x, 



\AA\\ 



|A6||: 



lArll 



11^112 ' II&II2 ' ||r||2 ■ 

Like equation (2.6 1, the two condition numbers appear in error bounds of the formj^ 



where r + Ar is the residual of the perturbed problem, 

X + Ax = arg min \\{b + Ab) — {A 



\\AA\\2 

Uh ■ 

AA)u\\: 



\Abh 
\\bh 



The quantities in the formulas (3.1 3.2 1 are 

11^112 .....^ \\Ax\\2 



cot(6>) 



\Ax\[ 



\r\\2 



\X\\2 0-n 



(3.3) 



(3.4) 



(3.5) 



(3.6) 



where is the spectral matrix condition number of A (CTmin is the smallest singular value of 
A), V is van der Sluis's ratio between 1 and /^aj^and 9 is the angle between b and col(A). 

1 . K2 depends only on the extreme singular values of A. 

2. 9 depends only on the "angle of attack" of b with respect to col (A). 

3. If A is fixed, then V depends on the orientation of b to col (A) but not on 9^ 

Please refer to Figure 3.1 as needed. If col (A) is fixed, then /tg and 9 are separate sources 
of ill-conditioning for the residual. The ratio V can never cause ill-conditioning because it 
only appears in the denominator of equation (3.1 1 and V is always at least 1. Indeed, if Ax 
has comparatively large components in singular vectors corresponding to the largest singular 
values, then V ss K2 and V might lessen the ill-conditioning caused by a small 9. 

3.3. Conditioning example. This example illustrates the effects of V, and 9 on 
Xr{A). It is based on the example of Golub and Van Loan ( 1996 p. 238). Let 





"10" 




/3cos(0) 







0" 


A^ 


a 


, b = 


/? sin(<^) 


, AA^ 















1 




— e 


— e 



where < e ^ a, /3, and a < 1. In this example. 



(3cos{(j)) 
i sin(0) 



Ar 



/3cos(0) + ^sin(0) 



'The constant denominators ||A||2 and ||fe|l2 could be discarded from the o terms because only the order of 
mag nitude of the ter ms is pertinent. 

^Van der Sluis|(l975| p. 251) introduced no notation. The Greek letters that look and sound like English v are 
V and /3, respectively, so it seems best to choose Roman V for van der Sluis. 

'Because A has full column rank, Ax and x can only vary proportionally when their directions are fixed. 
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The three terms in the condition number are 

/«2 = - , V = I cot(0) = /3. 

a ^[acos(0)]2 + [sin(0)]2 

These values can be manipulated by choosing a, /? and The tight upper bound on the 
condition number with respect to A is 



XAA) < - ./l + [a/3cos((/))]2 + [/3sin(,/))] 
a V 



The relative change to the residual 

l|Ar||2 _ 1 



^l + a^ + [a(3 cos(0) + /3 sin(0)]2 e + ©(e^) ^ 



||r||2 a 

can made be close to the bound on Xri^) times || AAII2/II A||2 ~ y/2e. These formulas have 
been verified using Mathematica ( Wolfram") |2003 1, as have formulas throughout the paper. 

4. Derivation of tlie condition number estimates. 

4.1. Choice of Norms. In theoretical numerical analysis especially for least squares 
problems the 2-norm is preferred because for it the matrix condition number of A* A is the 
square of the matrix condition number of A. The norms used in this paper and in many other 
papers are defined as, 

||vec(AA)||..^, IIMII,.^, IIAHk^"^. ,4., 

where the choice of scale factors is left open. The scaling makes the size of the changes 
relative to the particular problem of interest. The scaling used in equations ( 3. 1 - 3.3 1 is 

A=\\A\\2, 6=||6||2, 7^=||r||2. (4.2) 

Some authors prefer to measure the residual relative to b by choosing TZ = \\b\\2- Other 
authors have no scaling, A — B = TZ = 1. All of these cases are accommodated by the 



notation in equation (4.1 1. The effect of the choice for TZ is discussed in section 6.1 



4.2. Notation. The formula for the Jacobian matrix Jr{b) of the residual r = [I — 
A{A*A)^^A*]b with respect to b is clearp"] For derivatives with respect to the entries of A, 

'"The notation of section \22\ would introduce a name, /2, for tlie function by which r varies with b when A is 
held fixed, r = f2{b), so that the notation for the Jacobian matrix is then Jf^ (b). This pedantry will be discarded 
here to write Jr (6) for the matrix of partial derivatives of r with respect to b with A held fixed. 
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it is necessary to use the "vec" construction to order the matrix entries into a column vector; 
vec(i?) is the column of entries Bi j with i,j in co-lexicographic orderp] The first order 
approximation ]2.5\ is then 

Ar = Jr[vec{A)] vec(Ayl) + Jr{b) Ab + higher order terms in and Ab (4.3) 

and upon taking norms 

II Ar||K < II J,.[vec(^)] vec{AA)\\n + || J,(6) A6||k + o (. . . ) 

< II J.[vec(AA)]|| ||AA|U + II J,(b)|| ||A6||b + o (. . . ) (4.4) 
Xr{A) Xr{b) 

where the norms of the two Jacobian matrices are induced from || • Hk, || • and from || • Ht^, 
II • lie, respectively. The high order term in equation ( |4.4| i is o(max{|| AA||_4, ||A5||e}) 
because from equation (2.2 1 the norm max{|| • m, || • ||b} has been given to the space that 
jointly consists of matrices A and vectors b. 

4.3. Condition number of r with respect to b. For the orthogonal projection P defined 



m section 



3.1 from r = (/ - P)b follows Jr{b) = / 



||J.(6) Ab\\n 

\\Jr(b)\\ = max ,, . , , = max 

" ^ ^" A6 ||A6||e Afc 



P, hence 

|(/-P)Ab||2 

n 



B 

n 



(4.5) 



which is equation (3.2 1 for the choice of scale factors in equation (4.2 1. 

4.4. Condition number of r with respect to A. Evaluating the condition number of the 
residual requires a formula for the Jacobian matrix Jr[vec(74)]. Differentiating the entries of 

r= [I - A{A'Ay^A']b 

by those of A seems to be a daunting task. Instead, Jr[vec(A)] is constructed from the total 
differential of the identity, 



0. 



" / 


A' 




r 




' b' 


A' 







X 








Assuming b is fixed because it already has been treated in section [43] the total differential is 

Xil X2I 



I A 
A' 



dr 
dx 



Xjil 



eir* 62^* 



w&c{dA) = 0, 



where Xi is the i-th entry of x and where is the z-th column of the n x n identity matrix. 
Hence 



dr 




' I A' 


-1 r 


dx 




A* 








I- 


P 






(A'A) 





Jr[vec{A)] 
J^[vec(A)] 



Xll X2I ■ ■ 

eir* e2r* • • 
A{A*A)-^ 

\ec{dA) 



Xnl 



\ec{dA) 



Xil X2I 

eir* 62^* 



Xjil 



we.c{dA) 



(4.6) 



' ' The alternative to placing the entries of matrices into column vectors is to use more general linear spaces and 
the Frechet derivative. That approach seems unnecessarily abstract because the spaces have finite dimension. 
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in which P — A{A*A)^^A''^ is the orthogonal projection into the column space of A. The 
two matrix blocks in equation (4.6i are the Jacobian matrices of r and x as functions of the 
entries of A with b held fixed. 

4.5. IVanspose formula for condition numbers. The desired condition number is the 



norm induced from the norms in equation ( 4. 1 



I JrfvecfA)!!! = max 

AA 



A 



— max 
7^ AA 



\Jr[vec{A)] vec(AA)||K 
||vec(AA)|U 

J,.[vec(A)] vec(AA)||2 



||AA||. 



The numerator and denominator are vector and matrix 2-norms, respectively. If A is an m x n 
matrix, then this maximization is a large problem with mn degrees of freedom. The identity 
for the norm of the transposed operator can be applied to reduce the degrees to m, 



, .r ..MM ||J^[vec(A)]*Ar||^ 
|Jr[vec(A)]|| = — max " ^ ^ " 



7^ Ar 



lAr 



(4.7) 



Here, the identical norm for the transposed Jacobian matrix is induced from the duals of the 
2-norms for matrices and vectors. The vector 2-norm is its own dual. The dual of the matrix 



2-norm is determined in Grcar ( 2009) to be the sum of the singular values of the matrix, 
including multiplicities. This norm is sometimes called the nuclear norm or the trace norm. 

4.6. Condition number of r with respect to A, continued. The application of equation 
( |4.7| l requires the evaluation of the matrix-vector product in the numerator Note that for any 
vectors r' and x'. 



Xil X2I 



Xfil 



vec[r'x* + r (x')*] 
With this identity it is now possible to compute, from equation (|4.6|l. 



- t 






■ r' ' 




x' 



{J^[vec(A)]}*Ar = 


' Jr[vec{A)] ' 
_ J^[vec(A)] 


t 


' Ar " 







Xil X2I 










eir* e2r* 




■ enT 






Xil X2I 










eir* 627"* 






= vec(uj^wj + U2 





I-P 

{A*A)-^A*' 



A{A^A)-^ 





' Ar " 








(/- P)Ar 
{A'Ay^A' Ar 



in which 



Ui 
U2 



{I - P)Ar 



Vl 



V2 = Ar 



(4.8) 



Thus equation (|4.7|l is the following optimization. 



vec(A) = — max " ^ / 



A 

— max 

7^ l|Ar|U = l 



(4.9) 
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For ease of notation, let g{Ar) be the objective function in equation (4.9 1. In Grcar 
( |2009| ) it is shown that 



g{Ar) = 



Ui 



Hi ll^'illi + Ml 1^2111 + 2 ||ui||2 Wvih \\u2h \\v2h cos(6>„ - 6»„) , 



(4.10) 



where 0„ is the angle between ui and U2, and By is the angle between vi and V2, and both 
angles should be taken from to tt. Evaluating the maximum has two parts. 

The first step shows Ar can be restricted so that cos{6u — Oy) > 0. The vector Ar 
always could be decomposed into a component in col (A) and a component orthogonal to 
this subspace. Let the component inside col (A) be a'. Further, the component outside can 
be decomposed into components parallel to r and orthogonal to r, say 7r + r' for some 
coefficient 7. With these choices to express Ar, 



Ar 



7r 



where a G col (A), r' _L col (A), r' L r 



the vectors in equation (4.8 1 are 



Ml 
U2 



r 



Vl — X 



V2 



{A'A)-^A'a' 



and the angles are 



cos(0„) 



M1M2 



7lkll2 



Ikilb IIW2II2 v/t^IHT+FI 



cos{6v) = 



V\V2 



x\A'A)~^A*a' 



.,112 



I«i||2lk2||2 \\x\\2\\{AtA)-^A*a'h 



Thus the sign of 7 affects only the angle 0„ in equation ( |4.10[ l, so it can be chosen to place 0„ 
in the same quadrant as By (either from to 7r/2, or from 7r/2 to tt) and hence cos(0„ — 0u) > 
0. This means the maximum of equation (|4.7|i can be restricted to those Ar for which 



L(Ar) 



VlKllilkl|li+h2||i 11^2111 < 5(Ar) < ||Mi||2||^l||2 



k2||2||«2||2 (4.11) 

= C/(Ar) . 



The second step chooses Ar to maximize the upper bound U{Ar). As before, the vector 
Ar always can be decomposed into a component in col {A) and a component in the orthogonal 
complement. Without loss of generality, assume Ar = cos{(j))r" + sin(0)a" where a" and 
r" are unit vectors in col (A) and the complement, respectively, and where th e coe fficients 
are determined by an angle between and 7r/2[^ The vectors in equation (4.8 1 for this 
representation of Ar are 



Ml = cos(0)r" 
M2 = r 



Vl 



W2 =sin((/))(yl*A)-iA*a" 



'^The coefficients cos(</i) and sin(0) are non-negative so the choice of ij> does not affect the choice of sign 
needed for equation |4.1 1|. 
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The largest ||w2||2 occurs when a" is a left singular vector for the smallest singular value of 
A, cTmini in which case V2 = (sin(0)/crniin) a"; altogether 



[/(Ar) = cos(0) ||a:;||2 + sin((/)) 



The maximum of this formula with respect to (j) determines an optimal Arbnd where the upper 
bound is 




C/(Arbnd) 



The maximum has been verified using Mathematica ( jWolfram] |2003| l. 



(4.12) 



The formula in equation (4.12 1 is the maximum of the upper bounds, which is not to say 
it is the maximum of equation (4.7 1. The objective function g and the lower and upper bounds 
L and U, when evaluated at Arbnd and Armax, must be aiTanged as follows, 



i(Arb„d) < 5(Arb„d) < 5(Ar„,ax) < C/(An„ax) < C/(Arb„d). 



These inequalities have the following justifications: (a) equation ( |4.11 
(c) equation (4. 12 1, and (d) choice of Arbnd 



the maximum. From the formula for L(Ar) in equation (4.11 



(b) choice of Arn 



Therefore equation (4.12 1 is an upper bound for 
the upper bound is at most 



\/2 times larger than a lower bound for the maximum. Note that to complete the limits and 



the condition number, these values must be scaled by the coefficient A/TZ in equation (4.7 1 



4.7. Summary of condition numbers. 

Theorem 4.1 (Spectral condition numbers). For the full rank linear least 
squares problem with solution x — {A^A)^^A^b and residual r = b — Ax, and for the 
scaled norms in equation (|4.i|) with scale factors A, B, and TZ, 




Xr{b) 



xU < xAA) < - 




(4.13) 



where CTmin is the smallest singular value of A. These formulas simplify to those in section 
\3.2\for the choice of scale factors in equation \4.2\. 



Proof. Section 4.3 derives Xr{b), and sections 4.4-4.6 derive the bounds on Xr{A). □ 



5. Comparison with published bounds. 

5.1. The estimate of Wedin. Table 15.11 lists the condition number estimates in some 
textbook error bounds for the least squares residual. All the values exceed the upper estimate 
of theorem |4~T] to varying degrees. 

The very early formula of Wedin | ( |1973| p. 224, eqn 5.4) is also reported in the more 
recent textbook of Bjorck ( 19961 p. 30, eqn. 1.4.27). It is for perturbations only to A, that is 
for the choice A6 = 0, and for the choice of scale factors A = TZ = 1. The value exceeds 
the estimate in theorem 4.1 by at most the factor \/2, so it is at most double the condition 
number The other two values in Table 15. ll can be severe overestimates. 
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Table 5.1 

Condition number estimates in textbook error bounds for the least squares residual. The full rank least squares 
problem is rainx \\b — Ax\\2, the solution is x, the residual is r, the smallest nonzero singular value of A is cr^^^, 
the condition number of A is = ||A||2/a",„in. 





norms and scale factors 


maximum overestimation factor 


source 


data 


residual 


estimate for Xr{A)** 




Wedin ( 


1973 


p. 224, 


||AA||2 




l|Ar||2 

7^ = 1 


'-'mill 


2 


eqn. 5.4),|BjorckH1996| 
p. 30, eqn. 1.4.27) 


Stewart 


1977 


p. 655), 


||AA||2 


Ah = Q 


l|Ar||2 

7^ = 1 






Stewart and Sun|l 19901 


p. 160, sec. 5.2) 


Golub and Van Loan 


f||A/l||2 \\Mh\ 


||Ar||2 




K.2 


jl996| p. 242, eqn. 
5.3.9),|Iligham 12002 
p. 382, eqn. 20.2)" 


"'n uh ' ii^ii. / 

A=\\A\\2 B=\\h\\2 


\\bh 
71= ||6||2 


Theoren 
equation 


i4.1 

4.1. 


i 


||AA||2 




||Ar||2 




V2 




A 


n 



** The formula of Golub and van Loan and of Higham amounts to an estimate for Xr {A) + Xr (b) 
and is compared against the sum of Xr(b) and the tight estimate for Xr{A). See section[53] 



5.2. The estimate of Stewart. The value of |Stewart| ( |1977 
Stewart and Sun] ( [T990l p. 160, sec. 5.2) 

Some assembly is required. Let B = A + AA be the perturbed matrix. Assume 



p. 655) is also reported by 
It again is for choices Ab = and A = TZ = 1. 

AAh < 



Cmin SO that B also has full rank. 

For any matrix M, let Pm — MM^ be the orthogonal projection into the column space 
of M. With Ah — the difference between the residuals of the original and the perturbed 
problems ( |l.l[|3.5| l is A?- = (/ — Pb)6 — (/ — PAjb so it is always true that 



lAr||2 < \\PA-PBh\\h\\ 



(5.1) 



Pb 1 1 2 is to be bounded by applying an earlier result. 

into 



Stewart| ( [T977] p. 655) remarks that \\Pa 

He does not intend \\Pa — Pb\\2 < 1 (p. 651, eqn. 4.1) which converts equation (5.1 
the useless ||Ar||2 < ||6||2 



AA\ 



2 ^ ||w||2- Stewart means a complicated expression that introduces 
into the bound. This expression requires some preparation that is more easily followed in the 
presentation of |Stewart and"Sunl ( |T99 0' pp. 160, 153, 148, 137). 

Continuing the assembly of the bound, let the singular value decomposition of A be 



[Ui U2]'av = 



Ai 




where [ J7i, U2] and V are square orthonormal matrices and where Ai is the square diagonal 



matrix of singular values. Let the corresponding factorization of AA be ( Stewart and Sun 
[19901 p. 137) 



[ Ui U2 ]^AAV = 



El 
E2 



12 
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where \\E,\\2 < ||AA||2 fori = 1, 2. [Stewart and Sun| ( [T990l p. 148) define 

k^\\Ah\\{A,+E,)-%. 

From the triangle inequality and from the Neumann series expansion for {Ai + Ei)^^, 

I ll^r'lb - \\{A,+E,)-'\\2 I < \\A^' - (A, + E,)-'\\2 < 0{\\AA\\l) . 

These last two equations combine to 

k^\\Aha-l + 0{\\AA\\l). (5.2) 

The final step appUes a bound that requires some further hypotheses. For any matrix 
M, similar to Pm = MM\ let Rm = Pm* = {M^ Mf be the orthogonal projection 
into the row space of M (viewing the rows as column vectors). Since — [B^ B)^^ 
is a continuous function of A^, therefore both BB^ — AA^ and — A"^ A converge to 
as A A approaches 0|^ If ||Ay4|j2 is sufficiently small that both \\Pa ^ Pb\\2 < 1 and 
||-R^-^b||2 < 1, then it can be shown (iStewart and Sun||1990| p. 153, eqn. 4.1) 



\PA-PBh < 



k\\E. 



2112/ 



[l + {k\\E42/\\Ahrf' 

Altogether, combining equations ( |5.1f[5.3| l leaves 



|Ar||2 < 



l^'ll: 



\\AAh + 0{\\AA\\l), 



(5.3) 



(5.4) 



which is the bound from which the condition estimate in table 15. H is taken. 



The estimate for Xr (^) in equat ion ( 5 .4 1 can be obtained from theorem 4. 1 by increasing 
the second term in the upper bound (4. 13 1 by the factor V^, 




Consequently, Stewart and Sun's value can overestimate the upper bound for Xr{A) by as 
much as V depending on circumstances. The worst situation is illustrated by the example of 
section : 



33] with = and /? > 1/a, 




^ +/32cos2(^) + ^sin^ff/)) 



5.3. The estimate of Golub and Van Loan and of Higham. The condition estimate of 



Golub and Van Loan,(1996,. p. 242, eqn. 5.3.9) and of .Higham, ( ,2002, p. 382, eqn. 20.2) is for 
the choices AA ^ and A6 ^ with the scale factors A = \\A\\2 and TZ = B = \\b\\2. They 
take the approach of equation ( |2.4| i that uses a single quantity, e, to measure the perturbations 
to A and b. 



\AA\\2 < e\\A\\2 
IIA6II2 < £||&||2 



equivalently 



\AA\\ 



|A6||. 



(5.5) 



Specific bounds on tlie norms of these differences can be derived from Wedin | jl973| p. 221,thm. 4.1). 



LINEAR LEAST SQUARES RESIDUAL 



13 



Since B — TZ, this approach can transform the bound (4.4 1 as follows, 

lA&ll 



l^^ll^<||J.[vec(^)].'-^ll^^ll^ 



B 



< 



\Jr[vec{A) 



B A 

11^ 
II g 



|J,(6)|| 



B 



oie) 



Mm 



e + o(e) 



[xAA) + xr{b)]e + oie) . 



(5.6) 



From theorem 4.1 with the choices TZ = B = \\b\\2 

Xr{A)+Xr{b)< ' ''"^''^ 



< 




(5.7) 



Golub and Van Loan and Higham state a larger value, + 1 {3 

Since these formulas 

can be derived from the sum Xr{A) + Xr{b) they are not joint condition numbers in the 



sense of equation (2.4i. Moreover, the derivation inserts V into equation (5.7i, so the result 



can overestimate the sum by as much as a factor of Close to the worst situation for the 
specific value 2/«2 + 1 of Golub, Van Loan and Higham is again illustrated by the example 
of section 3.3 with = and /? ^ 1/a, 




2 

a 



1 



+ 13^} +1 



1 

a 



Note the bound (5.7 1 is not sensitive to the angle 6 between r and col{A) because of the 



choice for the scale factor TZ 
6. Discussion. 



6II2. Choices for TZ are discussed in section 6.1 



6.1. Measuring perturbations to r relative to b. As mentioned in section 4. 1 scaled 
changes to the residual are typically measured by choosing TZ ~ \\r\\2 or ||5||2. The two cases 
are contrasted in Table 6.1 The choice TZ = \\b\\2 makes it appear that 6 is not a source of 
ill-conditioning because the sensitivity of r to A is masked by measuring changes to r against 
the always larger vector b. The choice TZ = \\r\\2 measures perturbations relatively. The next 



section 6.2 describes a situation when the relative measure is appropriate. 



6.2. Significance for iterative methods. Many iterative methods proceed by building 
orthogonal bases from the residuals of least squares projections. For example, for a symmetric 
matrix A and a unit vector t^i, the Lanczos iteration 



Av, 



produces a sequence of orthonormal vectors vi, V2, W3, This algorithm can be viewed as 

repeatedly evaluating a residual rj+i = /3j+iVj^i for either of two orthogonal projections: 



'''Their value resembles the \/2 + 1 that was originally stated by Bjorck 1967 p. 16, eqn. 7.7) 
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Table 6.1 

Effect of scaling on condition numbers for the least squares residual. The full rank least squares problem is 
minx \\b — Ax\\2, the solution is x, the residual is r, = ||^||2/<''min the spectral matrix condition number of 
A o"min is the smallest singular value of A,y = \\ Ax 1 1 2 / ( 1 1 a; 1 1 2 cTmin ) is van der Sluis 's ratio between 1 and , 
and is the angle between b and col (A). 



norms and scale factors 


condition numbers 


||A4||2 


l|Ab|l2 


l|Ar||2 


tight estimate for Xr [A) 


Xr{b) 


A 


B 


7^ 


A=\\A\\2 


e = II6II2 


71= ||r||2 


1 /cot(6»)V 


csc{e) 


A=\\A\\2 


Z3= II6II2 


7^= 11&1I2 


.,^sin^W+(-f))' 


1 



(1) the projection of Avj into the span of Vj^i and vj, or (2) the orthogonal projection of 
A^Vi into the Krylov subspace spanned by vi, Avi, . . ., A^^^vi. 

The appropriate scale factor for measuring perturbations to rj+i is 7?. = ||rj+i||2. The 
relative error in rj^i becomes the absolute error in the normalized vector, f j+i, which con- 
tinues the Lanczos iteration. In ideal circumstances the vectors vj^i and Vj are close to 
orthonormal. If A ~ [vj^i,Vj] is an orthonormal matrix, then K2 — V = 1 so the tight 



estimate in Table 6.1 simplifies to Xri^) ~ csc(0) = Xr{b) where 9 is the angle between 



b = Avj and col(A). Thus r^+i is ill-conditioned when 9 is small. 

6.3. Condition numbers of orthogonal projections. In the least squares problem, the 
condition number of the orthogonal projection Ax is essentially that of the residual. In addi- 



tion to equations (4.1 4.2 1, it is necessary to specify the scaled norm for the projection: 



l|A(^a;)|l - ^^^^ where V = \\Ax\\2 . 

From Ax — Pb follows Jax (b) = P hence XAx (b) = B/V. Since Ax — b — r so Jax {A) = 
—Jr{A). With V replacing TZ in the formulas, the condition numbers of the orthogonal 
projection are 

XAx{b) = sec{e), 



Both K2 and 9 are independent sources of ill-conditioning. 

6.4. Column transformations. The linear least squares residual is invariant with re- 
spect to transformations of the matrix columns, so there is reason to seek changes to the 
columns that might reduce the condition number of the residual. If A is replaced by AM 
for some nonsingular matrix AI that makes AM an orthonormal matrix, then with the scale 



factors of equation (4.2 1 it has been noted in section 6.2 that the tight estimate is Xr{AM) w 
csc{9) which leaves only as a source of ill-conditioning. 

A less costly transformation is M — D for a diagonal matrix D. Two reasons suggest 
choosing D to equilibrate the columns of AD. First, least squares problems typically are 



LINEAR LEAST SQUARES RESIDUAL 



15 



solved using the QR factorization. The errors of that calculation can be accounted for by 
backward rounding errors whose relative size in each column is roughly the same across all 



columns (Higham 2002 p. 385, thm. 20.3). Second, equilibrating the columns is approx- 



imately the optimal column scaling to reduce the matrix condition number (van der Sluis 
|1969| l. Nevertheless, even if Kg (AD) < n^i-^)^ the scaling also alters van der Sluis's ratio 
in equation ( |3.6[ ), so it is unclear whether the net change to the condition number in equation 
([TTJ is for the better. 

Acknowledgements. I thank the editor Prof. D. O'Leary and the three referees for cor- 
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